データの全体像¶
train:1,482,535個のユーザーが投稿した商品
test:693,359行で「価格(Price)」の項目がテストデータは含まれていないため、列数は「7」となっている
- train_id / test _id – ユーザー投稿のID
- name – 投稿のタイトル。タイトルに価格に関する情報がある場合(例:$20)はメルカリが事前に削除をして[rm]と置き換えている。
- item_condition_id – ユーザーが指定した商品の状態 (1~5,大きい方が状態が良い)
- category_name – 投稿カテゴリー
- brand_name – ブランドの名前
- price – 訓練データのみ。実際に売られた価格。米ドル表示。今回のチャレンジの予測ターゲットとなる。
- shipping – 送料のフラグ。「1」は販売者負担。「0」は購入者負担。
- item_description – ユーザーが投稿した商品説明の全文。タイトルと同様に価格情報がある場合は[rm]と置き換えられている。
ライブラリのインポート¶
In [1]:
import re
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
%matplotlib inline
データの読み込み¶
In [2]:
%%time
# Kaggle Notebook
train = pd.read_csv("/kaggle/input/mercari-dataset/train.tsv", delimiter='\t')
test = pd.read_csv("/kaggle/input/mercari-dataset/test.tsv", delimiter='\t')
# Local
# train = pd.read_csv("train.tsv", delimiter='\t')
# test = pd.read_csv("test.tsv", delimiter='\t')
CPU times: user 8.56 s, sys: 728 ms, total: 9.29 s Wall time: 9.3 s
In [3]:
print(train.shape)
print(train.info())
display(train.head())
(1482535, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1482535 entries, 0 to 1482534 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 train_id 1482535 non-null int64 1 name 1482535 non-null object 2 item_condition_id 1482535 non-null int64 3 category_name 1476208 non-null object 4 brand_name 849853 non-null object 5 price 1482535 non-null float64 6 shipping 1482535 non-null int64 7 item_description 1482529 non-null object dtypes: float64(1), int64(3), object(4) memory usage: 90.5+ MB None
| train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
| 1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
| 2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
| 3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
| 4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
In [4]:
print(test.shape)
print(test.info())
display(test.head())
(693359, 7) <class 'pandas.core.frame.DataFrame'> RangeIndex: 693359 entries, 0 to 693358 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 test_id 693359 non-null int64 1 name 693359 non-null object 2 item_condition_id 693359 non-null int64 3 category_name 690301 non-null object 4 brand_name 397834 non-null object 5 shipping 693359 non-null int64 6 item_description 693359 non-null object dtypes: int64(3), object(4) memory usage: 37.0+ MB None
| test_id | name | item_condition_id | category_name | brand_name | shipping | item_description | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | Breast cancer "I fight like a girl" ring | 1 | Women/Jewelry/Rings | NaN | 1 | Size 7 |
| 1 | 1 | 25 pcs NEW 7.5"x12" Kraft Bubble Mailers | 1 | Other/Office supplies/Shipping Supplies | NaN | 1 | 25 pcs NEW 7.5"x12" Kraft Bubble Mailers Lined... |
| 2 | 2 | Coach bag | 1 | Vintage & Collectibles/Bags and Purses/Handbag | Coach | 1 | Brand new coach bag. Bought for [rm] at a Coac... |
| 3 | 3 | Floral Kimono | 2 | Women/Sweaters/Cardigan | NaN | 0 | -floral kimono -never worn -lightweight and pe... |
| 4 | 4 | Life after Death | 3 | Other/Books/Religion & Spirituality | NaN | 1 | Rediscovering life after the loss of a loved o... |
In [5]:
# データ数を取得
data_counts = [len(train), len(test)]
labels = ['Train Data', 'Test Data']
# 円グラフを作成
plt.figure(figsize=(8, 6))
# カスタムラベルを作成
def func(pct, allvalues):
absolute = int(pct / 100. * sum(allvalues))
return f'{pct:.1f}%\n({absolute})'
plt.pie(data_counts, labels=labels, autopct=lambda pct: func(pct, data_counts), startangle=90)
plt.title('Train and Test Data Counts')
plt.axis('equal') # 円を真円にする
plt.show()
欠損値¶
In [6]:
print(train.isnull().sum())
train_id 0 name 0 item_condition_id 0 category_name 6327 brand_name 632682 price 0 shipping 0 item_description 6 dtype: int64
In [7]:
print(test.isnull().sum())
test_id 0 name 0 item_condition_id 0 category_name 3058 brand_name 295525 shipping 0 item_description 0 dtype: int64
In [8]:
# 欠損値の個数を計算
missing_data_train = {
'category_name': train['category_name'].isnull().sum(),
'brand_name': train['brand_name'].isnull().sum(),
'item_description': train['item_description'].isnull().sum()
}
total_rows = len(train)
# 各列の欠損値の割合を計算
missing_percentage = {k: (v / total_rows) * 100 for k, v in missing_data_train.items()}
# 横並びのサブプロットを作成
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
# 欠損値をプロットする関数
def plot_missing_data(ax, column_name, missing_count, missing_percent):
labels = ['Missing', 'Not Missing']
sizes = [missing_count, total_rows - missing_count]
explode = (0.1, 0) # "Missing"のスライスを少し外側に表示
ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
ax.set_title(f"{column_name} Missing Values\nCount: {missing_count} ({missing_percent:.2f}%)")
ax.axis('equal') # 円を真円にする
# 各列ごとに欠損値をプロット
for ax, column in zip(axes, missing_data_train.keys()):
plot_missing_data(ax, column, missing_data_train[column], missing_percentage[column])
plt.tight_layout() # レイアウトを調整して重ならないようにする
plt.show()
In [9]:
# 欠損値の個数を計算
missing_data_test = {
'category_name': test['category_name'].isnull().sum(),
'brand_name': test['brand_name'].isnull().sum()
}
total_rows_test = len(test)
# 各列の欠損値の割合を計算
missing_percentage_test = {k: (v / total_rows_test) * 100 for k, v in missing_data_test.items()}
# 横並びのサブプロットを作成
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
# 欠損値をプロットする関数
def plot_missing_data(ax, column_name, missing_count, missing_percent):
labels = ['Missing', 'Not Missing']
sizes = [missing_count, total_rows_test - missing_count]
explode = (0.1, 0) # "Missing"のスライスを少し外側に表示
ax.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)
ax.set_title(f"{column_name} Missing Values\nCount: {missing_count} ({missing_percent:.2f}%)")
ax.axis('equal') # 円を真円にする
# 各列ごとに欠損値をプロット
for ax, column in zip(axes, missing_data_test.keys()):
plot_missing_data(ax, column, missing_data_test[column], missing_percentage_test[column])
plt.tight_layout() # レイアウトを調整して重ならないようにする
plt.show()
'price'¶
In [10]:
train['price'].describe()
Out[10]:
count 1.482535e+06 mean 2.673752e+01 std 3.858607e+01 min 0.000000e+00 25% 1.000000e+01 50% 1.700000e+01 75% 2.900000e+01 max 2.009000e+03 Name: price, dtype: float64
In [11]:
plt.subplot(1, 2, 1)
(train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
plt.xlabel('price+', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Price Distribution', fontsize=17)
plt.subplot(1, 2, 2)
np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
plt.xlabel('log (price+1)', fontsize=17)
plt.ylabel('frequency', fontsize=17)
plt.tick_params(labelsize=15)
plt.title('Log (Price) Distribution', fontsize=17)
plt.show()
In [12]:
plt.figure(figsize=(10, 6))
sns.boxplot(x=train['price'], color='salmon')
plt.title('Box Plot of Prices')
plt.xlabel('Price')
plt.grid()
plt.show()
'name'¶
In [13]:
# 商品名の長さを計算
train['name_length'] = train['name'].str.len()
# ヒストグラムを作成
plt.figure(figsize=(10, 6))
plt.hist(train['name_length'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Product Name Lengths')
plt.xlabel('Length of Product Name')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()
In [14]:
%%time
# 商品名を結合
text = ' '.join(train['name'])
# ワードクラウドを生成
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
# ワードクラウドを表示
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Product Names')
plt.show()
CPU times: user 34.5 s, sys: 1.86 s, total: 36.4 s Wall time: 36.1 s
In [15]:
# 商品名の長さと価格の関係
plt.figure(figsize=(10, 6))
plt.scatter(train['name_length'], train['price'], alpha=0.5)
plt.title('Product Name Length vs Price')
plt.xlabel('Length of Product Name')
plt.ylabel('Price')
plt.grid()
plt.show()
'item_conditon_id'¶
In [16]:
# item_condition_id のカウントを取得
condition_counts = train['item_condition_id'].value_counts().sort_index()
# 棒グラフを作成
plt.figure(figsize=(8, 6))
condition_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Products by Item Condition')
plt.xlabel('Item Condition ID')
plt.ylabel('Number of Products')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()
'category_name'¶
In [17]:
# TOP 5 RAW CATEGORIES
train['category_name'].value_counts()[:5]
Out[17]:
category_name Women/Athletic Apparel/Pants, Tights, Leggings 60177 Women/Tops & Blouses/T-Shirts 46380 Beauty/Makeup/Face 34335 Beauty/Makeup/Lips 29910 Electronics/Video Games & Consoles/Games 26557 Name: count, dtype: int64
In [18]:
def split_cat(text):
try: return text.split("/")
except: return ("No Label", "No Label", "No Label")
In [19]:
%%time
train['general_cat'], train['subcat_1'], train['subcat_2'] = \
zip(*train['category_name'].apply(lambda x: split_cat(x)))
train.head()
CPU times: user 5.45 s, sys: 259 ms, total: 5.71 s Wall time: 5.71 s
Out[19]:
| train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | name_length | general_cat | subcat_1 | subcat_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet | 35 | Men | Tops | T-shirts |
| 1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... | 32 | Electronics | Computers & Tablets | Components & Parts |
| 2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... | 14 | Women | Tops & Blouses | Blouse |
| 3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... | 21 | Home | Home Décor | Home Décor Accents |
| 4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity | 20 | Women | Jewelry | Necklaces |
In [20]:
x = train['general_cat'].value_counts().index.values.astype('str')
y = train['general_cat'].value_counts().values
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]
In [21]:
trace1 = go.Bar(x=x, y=y, text=pct)
layout = dict(title= 'Number of Items by Main Category',
yaxis = dict(title='Count'),
xaxis = dict(title='Category'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)
In [22]:
x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
y = train['subcat_1'].value_counts().values[:15]
pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
In [23]:
trace1 = go.Bar(x=x, y=y, text=pct,
marker=dict(
color = y,colorscale='Portland',showscale=True,
reversescale = False
))
layout = dict(title= 'Number of Items by Sub Category (Top 15)',
yaxis = dict(title='Count'),
xaxis = dict(title='SubCategory'))
fig=dict(data=[trace1], layout=layout)
py.iplot(fig)
価格設定(価格の対数)の観点から見ると、すべてのカテゴリがかなり適切に分散されており、特別な価格設定のカテゴリはない。
In [24]:
general_cats = train['general_cat'].unique()
x = [train.loc[train['general_cat']==cat, 'price'] for cat in general_cats]
In [25]:
data = [go.Box(x=np.log(x[i]+1), name=general_cats[i]) for i in range(len(general_cats))]
In [26]:
layout = dict(title="Price Distribution by General Category",
yaxis = dict(title='Category'),
xaxis = dict(title='Frequency'))
fig = dict(data=data, layout=layout)
py.iplot(fig)